Intro

# install.packages('ggplot2')
library(ggplot2)
# install.packages('ggthemes')
library(ggthemes)
# install.packages('car')
library('car')
# install.packages("ggcorrplot")
library(ggcorrplot)
library(gridExtra)

setwd("/Users/gdimino/r_stuff")

reds <- read.csv("wineQualityReds.csv")
whites <- read.csv("wineQualityWhites.csv")

# Create combined dataset

temp_reds <- reds
temp_whites <- whites
temp_reds['type'] <- "red"
temp_whites['type'] <- "white"

# create combined dataset
all <- rbind(temp_reds, temp_whites)

Examining the red wine data

head(reds)
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
dim(reds)
## [1] 1599   13

Univariate analysis

Quality histogram

The first thing is to look at the distribution of wines vs. quality. We see roughly normal distribution, perhaps slightly skewed.In particular, there are very few wines in the highest quality bins. This makes sense, since high-quality wines are relatively rare and difficult to create.

Feature boxplots

Next, we’ll look at the distributions for each of the chemical properties of wine, as sorted by quality. This should give us a good idea of how each property varies with wine quality.

Fixed acidity

The median of the fixed acidity increases with wine quality, though there are a number of outliers with large fixed acidity in the middle quality bins. Of course, there are a lot more samples in those bins.

Volatile acidity

Here, the relationship is pretty clear. The low quality wines have a high median volatile acidity, whereas high-quality wines have much less.
Volatile acidity measures acetic acid (vinegar) and other impurities, so this relationship makes sense.

Citric acid

Median citric acid increases with wine quality, although there seem to be a bunch of outliers in quality bin 7. Possible measurement error?

Residual sugar

Residual sugar seems to be roughly equivalent for the different quality bins, although far outliers in the middle bins make the scale hard to read.

Chlorides

There appears trend of decreasing chlorides in the higher quality bins, but this is a bit obscured by a number of far outliers that affect the scale.

Free sulfur dioxide

Median largest in the mid-quality wines, lower at each extreme. Outliers less extreme than the previous two.

Total sulfur dioxide

Free sulfur dioxide has a similar profile to total sulfur dioxide. Might be good to look for a correlation here.

Density

Median density declines with wine quality, especially in the highest bins.

pH

Menian pH also declines with wine quality, but the ranges are similar.

Sulphates

Just when you were getting bored, here;s another clear relationship. Median sulphates increase with increasing quality.

Alcohol

Low quality wines have a relatively low alcohol content, but this goes up in bins 6 and above. A low alcohol level could be a symptom of wine going to vinegar, couldn’t it (although there could be other factors). Good to check for correlation to volatile acidity.

Correlations of individual properties w/ quality

As a final step in our univariate analysis, let’s look at how strongly each property is correlated to wine quality:

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           0.01373164          -0.12890656          -0.05065606 
## total.sulfur.dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##            sulphates              alcohol              quality 
##           0.25139708           0.47616632           1.00000000

This shows the strongest positive correlation to alcohol content, and the strongest negative correlation to volatile acidity, which makes sense.

Other factors with relatively high correlations to quality (+ or -) are sulphates and citric acid.

Multivariate Analysis

Now that we have looked all of the properties individually and inspected how they vary in the different quality bins, it’s time to look at how the different combinations of qualities might affect the quality of a wine.

Correlation heatmap of properties

The first thing that popped out at me is that many of these properties are related, just because of chemistry. So, I did a correlation heatplot to find variables that are not completely independent.

Here it is:

Some things to notice here:

  • Fixed acidity and pH appear to be related, likewise, citric acid and pH. (Chemistry!)
  • As I speculated above, free sulfur dioxide and total sulfur dioxide are related. Again, probably Chemistry at work.
  • Density is inversely related to alcohol content as you would expect since alcohol is less dense than water. It is also related to fixed acidity, but I’m not sure that’s a causal relationship.

So, the conclusion here is that many of the properties give redundant information. Maybe we can decrease the dimensionality somehow.

Note: I’m sorry the graph is so scrunched in the PDF. It looks great on a large monitor.

Scatterplot matrix

The next step is do do a scatterplot matrix and see the relationships between each

Again, this is OK on a larger monitor but pretty terrible in PDF, so I’ve broken out the diagonal pieces:

Not sure if there’s a good way to blow up the off-diagonal part of the matrix, please advise!

Anyhow, this doesn’t show us too much more than we saw in the heatmap. Along the diagonal you can see that the distribution of most of the properties is roughly normal (Gaussian). Citric acid is a notable exception.

This may affect the ways we choose to analyze the data.

PCA for red wine

Since most of the properties are normally distributed, I decided to try a principal components analysis to reduce the dimensionality of the wine dataset.

I got some code from the interwebs for this one (can’t find the exact reference).

But the code and results seem reasonable, so here goes

The first step is to rescale the data (standardize range and stdev), then run the PCA.

  wine <- reds

  s <- as.data.frame(scale(wine[2:12]))
  wine.pca <- prcomp(s) 

Here is a summary of the results:

  summary(wine.pca)
## Importance of components%s:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7604 1.3878 1.2452 1.1015 0.97943 0.81216 0.76406
## Proportion of Variance 0.2817 0.1751 0.1410 0.1103 0.08721 0.05996 0.05307
## Cumulative Proportion  0.2817 0.4568 0.5978 0.7081 0.79528 0.85525 0.90832
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.65035 0.58706 0.42583 0.24405
## Proportion of Variance 0.03845 0.03133 0.01648 0.00541
## Cumulative Proportion  0.94677 0.97810 0.99459 1.00000

And a screeplot, which shows the amount how much of the variance each of the new components accounts for.

  screeplot(wine.pca, type="lines")

There isn’t a real cutoff in the screeplot, but it is clear that the first 4-5 principal account for most of the variance. So let’s have a look at them.

First princical axis

The first PA seems to relate to general acidity. It has a weakly positive relationship to quality.

  wine.pca$rotation[,1]
##        fixed.acidity     volatile.acidity          citric.acid 
##           0.48931422          -0.23858436           0.46363166 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           0.14610715           0.21224658          -0.03615752 
## total.sulfur.dioxide              density                   pH 
##           0.02357485           0.39535301          -0.43851962 
##            sulphates              alcohol 
##           0.24292133          -0.11323206
  first_pa <- wine.pca$x[, 1]
  scatterplot(first_pa ~ wine$quality, 
              xlab="Wine quality", 
              ylab="First PA", 
              main="Wine quality vs first PA (axis of acidity)", 
              labels=row.names(wine))

  lm(first_pa~wine$quality)
## 
## Call:
## lm(formula = first_pa ~ wine$quality)
## 
## Coefficients:
##  (Intercept)  wine$quality  
##      -1.3558        0.2406

Second princical axis

The second PA has high sulfur dioxide, high volatile acids and low alcohol (yuk). It falls off substantially in the higher-quality wines.

  wine.pca$rotation[,2]
##        fixed.acidity     volatile.acidity          citric.acid 
##         -0.110502738          0.274930480         -0.151791356 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          0.272080238          0.148051555          0.513566812 
## total.sulfur.dioxide              density                   pH 
##          0.569486959          0.233575490          0.006710793 
##            sulphates              alcohol 
##         -0.037553916         -0.386180959
  second_pa <- wine.pca$x[, 2]
  scatterplot(second_pa ~ wine$quality, 
              xlab="Wine quality", 
              ylab="Second PA", 
              main="Wine quality vs second PA (axis of funk)", 
              labels=row.names(wine))

  lm(second_pa~wine$quality)
## 
## Call:
## lm(formula = second_pa ~ wine$quality)
## 
## Coefficients:
##  (Intercept)  wine$quality  
##       3.7463       -0.6647

Third princical axis

The third PA is characterized by high volatile acidity and low alcohol. It also has low sulfur dioxide, although the meaning of this is less clear. It is inversely related to wine quality. Basically, vinegar.

  third_pa <- wine.pca$x[, 3]
  wine.pca$rotation[,3]
##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12330157           0.44996253          -0.23824707 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          -0.10128338           0.09261383          -0.42879287 
## total.sulfur.dioxide              density                   pH 
##          -0.32241450           0.33887135          -0.05769735 
##            sulphates              alcohol 
##          -0.27978615          -0.47167322
  lm(third_pa~wine$quality)
## 
## Call:
## lm(formula = third_pa ~ wine$quality)
## 
## Coefficients:
##  (Intercept)  wine$quality  
##       3.4698       -0.6156
  scatterplot(third_pa ~ wine$quality, 
              xlab="Wine quality", 
              ylab="Third PA ", 
              main="Wine quality vs third PA (axis of vinegar)")

Fourth princical axis

The fourth PA is very boring. Nothing strongly in the mix and no noticeble effect on quality.

  fourth_pa <- wine.pca$x[, 4]
  wine.pca$rotation[,4]
##        fixed.acidity     volatile.acidity          citric.acid 
##         -0.229617370          0.078959783         -0.079418256 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##         -0.372792562          0.666194756         -0.043537818 
## total.sulfur.dioxide              density                   pH 
##         -0.034577115         -0.174499758         -0.003787746 
##            sulphates              alcohol 
##          0.550872362         -0.122181088
  scatterplot(fourth_pa ~ wine$quality, 
              xlab="Wine quality", 
              ylab="Fourth PA ", 
              main="Wine quality vs fourth PA (axis of nothingburger)")

  lm(fourth_pa~wine$quality)
## 
## Call:
## lm(formula = fourth_pa ~ wine$quality)
## 
## Coefficients:
##  (Intercept)  wine$quality  
##      0.33946      -0.06023

Fifth princical axis

We’ll look at one more PA. This one seems to be characterized mostly by a lack of residual sugar, and, again, the effect on quality is minor.

  fifth_pa <- wine.pca$x[, 4]
  wine.pca$rotation[,5]
##        fixed.acidity     volatile.acidity          citric.acid 
##           0.08261366          -0.21873452           0.05857268 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          -0.73214429          -0.24650090           0.15915198 
## total.sulfur.dioxide              density                   pH 
##           0.22246456          -0.15707671          -0.26752977 
##            sulphates              alcohol 
##          -0.22596222          -0.35068141
  scatterplot(fifth_pa ~ wine$quality, 
              xlab="Wine quality", 
              ylab="Fifth PA ", 
              main="Wine quality vs fifth PA (axis of dryness)")

  lm(fifth_pa~wine$quality)
## 
## Call:
## lm(formula = fifth_pa ~ wine$quality)
## 
## Coefficients:
##  (Intercept)  wine$quality  
##      0.33946      -0.06023

LDA for red wine

Next is an linear discriminant analysis for the wine data. It should show the main axis that determines quality as a function of all the other properties.

It looks like the LD1 accounts for most of the quality variation, but the scatterplot shows that there is too much overlap in the distributions to be able to reliably sort out any but the highest and lowest quality wines.

Although there is overlap in the distributiosn, this does look like a reasonable measure of quality.

One problem: there seems to be an anomaly in the density. Perhaps it scaled badly, being so close to 1?

  wine <- reds
  library('MASS')

  s <- as.data.frame(scale(wine[2:12]))
  wine.lda <-
  lda(wine$quality ~ wine$fixed.acidity + wine$volatile.acidity+     wine$citric.acid + wine$residual.sugar + wine$chlorides + wine$free.sulfur.dioxide + wine$total.sulfur.dioxide + wine$density + wine$pH + wine$sulphates + wine$alcohol)
  wine.lda
## Call:
## lda(wine$quality ~ wine$fixed.acidity + wine$volatile.acidity + 
##     wine$citric.acid + wine$residual.sugar + wine$chlorides + 
##     wine$free.sulfur.dioxide + wine$total.sulfur.dioxide + wine$density + 
##     wine$pH + wine$sulphates + wine$alcohol)
## 
## Prior probabilities of groups:
##           3           4           5           6           7           8 
## 0.006253909 0.033145716 0.425891182 0.398999375 0.124452783 0.011257036 
## 
## Group means:
##   wine$fixed.acidity wine$volatile.acidity wine$citric.acid
## 3           8.360000             0.8845000        0.1710000
## 4           7.779245             0.6939623        0.1741509
## 5           8.167254             0.5770411        0.2436858
## 6           8.347179             0.4974843        0.2738245
## 7           8.872362             0.4039196        0.3751759
## 8           8.566667             0.4233333        0.3911111
##   wine$residual.sugar wine$chlorides wine$free.sulfur.dioxide
## 3            2.635000     0.12250000                 11.00000
## 4            2.694340     0.09067925                 12.26415
## 5            2.528855     0.09273568                 16.98385
## 6            2.477194     0.08495611                 15.71160
## 7            2.720603     0.07658794                 14.04523
## 8            2.577778     0.06844444                 13.27778
##   wine$total.sulfur.dioxide wine$density  wine$pH wine$sulphates
## 3                  24.90000    0.9974640 3.398000      0.5700000
## 4                  36.24528    0.9965425 3.381509      0.5964151
## 5                  56.51395    0.9971036 3.304949      0.6209692
## 6                  40.86991    0.9966151 3.318072      0.6753292
## 7                  35.02010    0.9961043 3.290754      0.7412563
## 8                  33.44444    0.9952122 3.267222      0.7677778
##   wine$alcohol
## 3     9.955000
## 4    10.265094
## 5     9.899706
## 6    10.629519
## 7    11.465913
## 8    12.094444
## 
## Coefficients of linear discriminants:
##                                     LD1           LD2          LD3
## wine$fixed.acidity           0.15576218  -0.510826253  -0.13230726
## wine$volatile.acidity       -2.14869965  -5.169157664  -2.80464132
## wine$citric.acid            -0.24353923  -1.810902037  -3.67023468
## wine$residual.sugar          0.09907188  -0.310654752  -0.27785760
## wine$chlorides              -4.49075830  -3.286220068   4.88913726
## wine$free.sulfur.dioxide     0.01015280   0.002518588   0.05746815
## wine$total.sulfur.dioxide   -0.01066123   0.015340541  -0.02412087
## wine$density              -132.46861030 494.715527325 442.83751116
## wine$pH                     -0.27624041  -4.797254644   0.75289349
## wine$sulphates               2.55180806  -0.768377584  -0.57558078
## wine$alcohol                 0.67697595   0.270197247   0.18108052
##                                     LD4           LD5
## wine$fixed.acidity         -1.151995674  1.826028e-01
## wine$volatile.acidity       2.625991390 -2.404376e+00
## wine$citric.acid            1.097971759 -2.639100e+00
## wine$residual.sugar        -0.399628931  4.467281e-01
## wine$chlorides             -8.619928322 -7.425094e+00
## wine$free.sulfur.dioxide    0.020697814 -5.792837e-02
## wine$total.sulfur.dioxide  -0.009733169  7.784032e-03
## wine$density              569.905215399 -4.286164e+02
## wine$pH                    -8.470107031  2.487246e+00
## wine$sulphates              0.055588624  8.097894e-01
## wine$alcohol                0.800344973 -5.659672e-01
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4    LD5 
## 0.8496 0.1028 0.0333 0.0086 0.0056
# Do a prediction
  wine.lda.values <- predict(wine.lda, s$quality)
  first_lda <- wine.lda.values$x[,1]
  scatterplot(wine$quality, wine.lda.values$x[,1])

  # ldahist(data = wine.lda.values$x[,1], g=wine$quality)

Final Plots and Summary

In the end, if the PCA analysis has some validity, you can see that two of the principal differences in wine were less important in determining quality: the general acidity and the residual sugar. But two others (PA2 and PA3) seem to be signatures of problems in winemaking. PA3, with its high volatile acidity and low alcohol,is probably wine going to vineagar. PQ2 has some volatile acidity and low alcohol, but is mainly distinguished by a high total sulful dioxice. This can also lend an off-taste to wine.

Here is a useful summary of wine faults: https://wine.appstate.edu/sites/wine.appstate.edu/files/Chart%20Aromas%20FH_0.pdf

Note: I didn’t follow the instructions of presenting exploratory analysis. No, I didn’t read it in time and pretty much did the whole thing as a formal report. So, the whole thing up till now is “final plots and summary.”

This was way more work, but bf I need to do a formal summary, please let me know.

Reflection

Tolstoy said that all happy families are alike, but each unhappy family is unhappy in its own way. Perhaps not true for families, but true enough for wine, at least for the wine in the top bins vs. the wines in the middle and lower bins.

What distinguishes the higher quality wine is the absense wine faults. It is fairly easy to distinguish the best wine from the others by the absense of these faults. PCS/LCA analysis shows some possible combinations of features that could be a signature of wine faults, but of course it’s just exploratory.

As far as the wines in the middle quality bins (that is, the overwhelming majority of the wine samples), the picture becomes more hazy because there are many types of wine fault, so wines can be less-than-perfect in many different ways, in different degrees and different combinations. I’m not sure if the mid-quality wines are blended, in which case, you’d expect wines with a complementary faults to be mixed together, which would muddy the water still more.

Anyhow, nice project, more fun than I thought.